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Abstract 



Personal data has value to both its owner and to institutions who would like to analyze 
it. Privacy mechanisms protect the owner's data while releasing to analysts noisy versions of 
aggregate query results. But such strict protections of individual's data have not yet found wide 
use in practice. Instead, Internet companies, for example, commonly provide free services in 
return for valuable sensitive information from users, which they exploit and sometimes sell to 
third parties. 

As the awareness of the value of the personal data increases, so has the drive to compensate 
the end user for her private information. The idea of monetizing private data can improve over 
the narrower view of hiding private data, since it empowers individuals to control their data 
through financial means. 

In this paper we propose a theoretical framework for assigning prices to noisy query answers, 
as a function of their accuracy, and for dividing the price amongst data owners who deserve 
compensation for their loss of privacy. Our framework adopts and extends key principles from 
both differential privacy and query pricing in data markets. We identify essential properties of 
the price function and micro-payments, and characterize valid solutions. 



1 Introduction 



Personal data has value to both its owner and to institutions who would like to analyze it. The 
interests of individuals and institutions with respect to personal data are often at odds and a rich 
literature on privacy-preserving data publishing techniques [H] has tried to devise technical meth- 
ods for negotiating these competing interests. Broadly construed, privacy refers to an individual's 
right to control how her private data will used, and was originally phrased as an individual's right 
to be protected against gossip and slander [8] . Research on privacy-preserving data publishing has 
focused more narrowly on privacy as data confidentiality. For example, in perturbation-based data 
privacy, the goal is to prevent access to an individual's personal data, while allowing legitimate 
users access to aggregate computations over a large population [TT] . 

To date, this goal has remained elusive. For example, one important result from that line of 
work is that any privacy-preserving mechanisms must strictly limit the number of queries that can 
be accurately asked over a collection of private data [10] , thus imposing a strict privacy budget for 
any legitimate user of the data [25]. Researchers are actively investigating formal notions of privacy 
and their implications in accurate data analysis. Yet, with rare exception [18], perturbation-based 
privacy mechanisms have not been used in practice. 

Instead, many Internet companies have followed a simple formula to acquire personal data. 
They offer a free service, attract users who provide their data, and then monetize the personal 
data by selling it, or by selling information derived from it, to third parties. For example, a recent 
article in the New York Times [6] mentions a study that found that a unique user is worth $4 to 
Facebook and $24 to Google. 

Up to now, most users have been willing to provide it in return for online services. That is, access 
to an otherwise free service, sometimes combined with a lack of understanding of how their data will 
be used, is sufficient to persuade users to contribute their personal data. But as individuals become 
increasingly aware of the value of their personal data, and the potential consequences of disclosing it, 
there has been a drive to compensate them directly for their data [27]. In fact, startup companies 



are currently developing infrastructure to support this trend. For example, www.personal.com 
creates personal data vaults, each of which may contain thousands of data points about its users. 
Businesses pay for this data, and the data owners are appropriately compensated, financially or 
otherwise. 

Monetizing private data is an improvement over the narrow view of privacy as data confiden- 
tiality because it empowers individuals to control their data through financial means. In this paper 
we propose a framework for assigning prices to queries in order to compensate the data owners for 
their loss of privacy. Our framework borrows from, and extends, key principles from both differen- 
tial privacy |llj and data markets [20} 122]. There are three actors in our setting: individuals, or 
data owners, contribute their personal data; a buyer submits an aggregate query over many owners' 
data; and a market maker, trusted to answer queries on behalf of owners, charges the buyer and 
compensates the owners. Our framework makes three important connections: 

Perturbation and Price In response to a buyer's query, the market maker computes the true 
query answer, adds random noise, and returns a perturbed result. Each amount of perturba- 
tion has a price: the smaller the perturbation, the higher the price. The buyer specifies how 
much or little perturbation he is willing to purchase when issuing the query. At one extreme, 
he can purchase the unperturbed data at a high price. At the other extreme, he can ask a 
query almost for free, but the noise added might be the same as in differential privacy |llj 
with conservative privacy parameters. The relationship between a query's accuracy and its 
cost depends on the query and the preferences of contributing data owners. Formalizing this 
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relationship is one of the central goals of this paper. 



Arbitrage and Perturbation Arbitrage is an undesirable property of a set of priced queries 
that allows a buyer to obtain the answer to a query more cheaply than its advertised price 
by deriving the answer from a less expensive alternative set of queries. As a simple example, 
suppose that a given query is sold with two options for perturbation, measured by variance: 
a variance of 10 for $5 and a variance of 1 for $200. A savvy buyer who seeks a variance of 1 
would never pay $200: instead, he would purchase the first query 10 times, receive 10 noisy 
answers, and compute their average. Since the noise is added independently, the variance 
of the resulting average is 1, and the total cost is only $50. Arbitrage opportunities result 
from inconsistencies in the pricing of queries which must be avoided and perturbing query 
answers makes this significantly more challenging. Avoiding arbitrage in data markets has 
been considered before only in the absence of perturbation [31 1201 ^1- Formalizing arbitrage 
for noisy queries a second central goal of this paper. 

Privacy-loss and Payments Given a randomized mechanism for answering a query q, a common 
measure of privacy loss to an individual is defined by differential privacy: it is the maximum 
ratio between the probability of returning some fixed output with and without that individ- 
ual's data. Differential privacy imposes a bound of e £ on this quantity, where e is a small 
constant, presumed acceptable to all individuals in the population. Our setting contrasts 
with this in several ways. First, the privacy loss is not limited a priori, but depends on the 
buyer's request. If the buyer asks for a query with low variance, then the privacy loss to (at 
least some) individuals will be high. In our framework, these data owners must be compen- 
sated for their privacy loss through the buyer's payment. At an extreme, if the query is exact 
(unperturbed), then the privacy loss to some individuals is total, and they will compensated 
appropriately. Also, we allow each data owner to value his or her privacy loss separately, by 
demanding greater or lesser payments. Formalizing the relationship between the privacy-loss 
and the payments to the data owners is a third central goal of this paper. 

By charging buyers for access to private data we overcome a fundamental limitation of perturbation- 
based privacy preserving mechanisms, namely the privacy budget. This term refers to a limit on 
the quantity and/or accuracy of queries that any buyer can ask, in order to prevent an unaccept- 
able disclosure of the data. For example, if a differentially-private mechanism adds Laplacian noise 
with variance v, then by asking the same query n times the buyer can reduce the variance to v/n. 
Even if queries are restricted to aggregate queries over multiple individuals, there exists sequences 
of queries that can reveal the private data for most individuals in the database [10] and enforcing 
the privacy budget must prevent this. In contrast, when private data is priced, full disclosure is 
possible only if the buyer pays a high price. For example, in order to reduce the variance to v/n, 
the buyer would have to purchase the query n times, thus paying n times more than for a single 
query. In order to perform the attacks in [10] he would have to purchase (roughly) n log 2 n queries 
and pay for all of them. 

Thus the burden of the market maker is no longer to guard the privacy budget, but instead to 
ensure that prices are set such that, whatever disclosure is obtained by the buyer, all contributing 
individuals are properly compensated. In particular, if a sequence of queries can indeed reveal the 
private data for most individuals, its price must approach the total cost for the entire database. 

The paper is organized as follows. We describe the basic framework for pricing private data in 



Sect. 2 In Sect. 3 we discuss the main required properties for price functions, developing notions 



of answerability for perturbed query answers and characterizing arbitrage-free price functions. In 



Sect. 4 we develop a notion of personalized privacy loss for individuals, based on differential privacy. 
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We define micro payment functions using this measure of privacy loss in Sect. 5 We discuss two 



future challenges for pricing private data in Sect. 7 disclosures that could result from an individual's 
privacy valuations alone, and incentives for data owners to honestly reveal the valuations of their 
data. We discuss related work and conclude in ISect. 81 and ISect. 91 



2 Basic Concepts 

In this section we describe the basic architecture of the private data pricing framework, illustrated 
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Figure 1: The pricing framework has three major components. (1) the interaction between the 
buyer and the market maker: the buyer asks a query Q = (q, v) and must pay its price, vr(Q); 
(2) the privacy loss: by answering Q, the market maker leaks some information e, about the 
private data from the data owners to the buyer; (3) the interaction between the market maker and 
the data owner: the market maker must compensate each data owner for her privacy loss with a 
micro-payment /Uj(Q). The pricing framework is balanced if the price 7r(Q) is sufficient to cover all 
micro-payments [ii and if the each micro-payment m compensates the owner for her privacy loss 



2.1 The Main Actors 

The Market Maker. The market maker is trusted by the buyer and by each of the data owners. 
The market maker collects data from the owners and sells the data in the form of queries. When the 
buyer decides to purchase a query, the market maker collects payment from the buyer, computes 
the answer to the query, adds noise as appropriate, returns the result to the buyer, and finally 
distributes individual payments to the data owners. The market maker may retain a fraction of 
the purchase query price as his own profit. 

The Owner and Her Data. Our data model is similar to that used in [32] . where the data 
items are called data elements, and here we call interchangeably data items. 

Definition 1 (Database). A database is a vector of real-valued data items x = (x±, X2, ... , x n ). 

Each data item Xi represents personal information, owned by some individual. In this paper 
we restrict the discussion to numerical data. For example, Xi may represent an individual's rating 
of a new product with a numerical value from Xi = meaning poor to Xi = 5 meaning excellent; 
or it may represent the HIV status of a patient in a hospital, X{ = meaning negative, and Xi = 1 
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meaning positive. Or, X{ may represent the age; or the annual income; etc. Importantly, each data 
item Xi is owned by an individual but an individual may own several data items; for example, if 
we have a table with attributes age, gender, marital-status, then items Xi,x%,X3 belong to the 
first individual, items x/\„x^,xq to the second individual, etc. 

The Buyer and His Queries. The buyer is a data analyst who computes some queries over 
the data. We restrict our attention to the class of linear aggregation queries over the data items in 
x. 

Definition 2 (Linear Query). A linear query is a real-valued vector q = (<zi, 92 • • ■ Qn)- The answer 
q(x) to a linear query on x is the vector product qx = q\x± + • • • + q n x n - 

Importantly, we assume that the buyer is allowed to issue multiple queries. This means the 
buyer can combine information derived from multiple queries to infer answers to other queries not 
explicitly requested. This presents a challenge we must address: to ensure that the buyer pays for 
any information that he might derive directly or indirectly. 

We give two examples that illustrate our data model and the main actors. 

Example 3. Imagine a competition between candidates A and B that is decided by a population of 
voters who each rate the competitors. The data domain {0, 1, 2, 3, 4, 5} represents numerical ratings. 
In our data model, x±,X2 represent the rating given by Voter 1 to candidate A and B respectively; 
x$, X4 are Voter 2's ratings of A and B respectively, and so on. The names of the voters are public, 
but their ratings are sensitive and should be compensated properly if used in any way. If the buyer 
considers Voter 1 and Voter 2 experts compared with the other voters he might give a higher weight to 
the ratings of Voter 1 and Voter 2. When a buyer wants to calculate the total rating for candidate 
A, he would issue the following linear query qi = (wi,0,wi,0,W2,0,W2,0,W2,0,...,W2,0) with 

W\ > U>2 > 0. 

Example 4. In economics, linear models are commonly used for prediction. Imagine that there 
are three attributes of a company that jointly determine its revenue. In a linear model, revenue = 
w\X\ + W2X2 + where x\,X2,x 3 are the data items for a company and w±,W2,w 3 are the 

coefficients. If the database contains k companies, then n = 3k, and xi,x<2,X3 represent the data of 
Company 1, £4, £5, X6 represent the data of Company 2, etc. If a buyer wants to predict the average 
revenue of these k companies, then he may issue the following linear query 
q2 = (w 1 /k,W2/k,w 3 /k,wi/k,W2/k,w 3 /k, . . .). 



2.2 Balanced Pricing Framework 



The pricing framework is balanced if (1) each data owner is appropriately compensated whenever 
the answer to some query results in some privacy loss of her data item Xj, and (2) the buyer is 
charged sufficiently to in order to cover for these payments. This definition involves three quantities: 



the payment tt that the buyer needs to pay the market maker (Sect. 3), a measure e$ of the privacy 



loss of data item Xi (Sect. 4), and a micro-payment /Xj by which the market maker compensates the 



owner of Xi for this privacy loss (Sect. 5) 



The buyer is allowed to specify, in addition to a linear query q, an amount of noise v that he is 
willing to tolerate in the answer to the query; the buyer's query is a pair Q = (q, v), where q is a 
linear query and v > represents an upper bound on the variance. Thus, the price depends both 
on q and v, vr(Q) = 7r(q, v) > 0. The market maker answers by first computing the exact answer 
q(x), then adding noise with mean and variance at most v. This feature gives the buyer more 
pricing options, because by increasing v he can lower his price. 
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Having received the purchase price for a query Q, the market-maker then distributes it to 
the data owners: the owner of data item x% receives a micro payment /ii(Q) > 0. If the same 
owner contributes multiple data items Xi,Xi + i, . . . then she is compensated for each. We discuss 
micro-payments in |Sect. 5j 

Finally, the micro-payment [n must compensate the data owner for her privacy loss: for that, 
we require that /tx(Q) compensates the data owner for the privacy loss e$. 

We say that the pricing framework defined by ir, e% and //j is balanced if (1) the payment received 
from the buyer always covers the micro payment made to data owners, that is Y17=i M»(Q) — ^(Q)) 
and (2) each micro-payment /ij compensates the owner of the data item Xi according to the privacy 
loss £j, as specified by some contract between the data owner and the market maker. We discuss 



balanced pricing frameworks and give a general procedure for designing them in Sect. 6 



3 Pricing Queries 



In this section we describe the first component of the pricing framework in Fig. 1 the pricing 
function tt(Q) = vr(q, v). We denote M+ = [0, oo) and R+ = R+ U {oo}. 

Definition 5. A price function is tt : W 1 x M + — > ]R + . 

In our framework, the buyer is allowed to issue multiple queries. As a consequence, an important 
concern is that the buyer may combine answers from multiple queries and derive an answer to a new 
query, without paying the full price for the latter: we call such a situation arbitrage. A reasonable 
pricing function must guarantee that no arbitrage is possible, in which case we call it arbitrage-free. 
Such a pricing function ensures that the market maker receives proper payment for each query by 
removing any incentive for the buyer to game the system by asking a set of cheaper queries in order 
to obtain the desired answer. In this section we define formally arbitrage-free pricing functions, 
study their properties, and describe a general framework for constructing arbitrage-free pricing 



functions, which we will later reuse in Sect. 5 to define micro-payments, and obtain a balanced 
pricing framework. 

3.1 Queries and Answers 

The market maker uses a randomized mechanism for answering queries. Given a buyer's query Q = 
(q, v), the mechanism defines a random function /Cq(x), such that, for any x, E(/Cq(x)) = q(x) 
and Var(/Cq(x)) < v. The market maker samples one value from this distribution and returns it 
to the buyer in exchange for payment 7r(Q). We abbreviate /Cq with K, when Q is clear from the 
context. 

Definition 6. We say that a randomized algorithm /C(x) answers the query Q = (q, v) on the 
database x if its expectation is q(x) and its variance is less than or equal to v. 



In Def. 6, answerability is defined only according to the expectation and variance of answers 
and that no specific noise distribution is required. Thus, there is an inherent assumption in the 
definition that the buyer only cares about the variance and is indifferent to other properties of the 
distribution, which can be determined later for specific applications. 

Notice that, strictly speaking, nothing stops the market maker from returning a very accurate 
answer, much more accurate that the buyer has requested. However, by doing so, he will disclose 
more personal information of the data owners and will therefore need to compensate them. The 
compensating micro-payments, in turn, can only be recovered from the payment from the buyer, 
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7r(Q). Hence, the market maker has no incentive in lowering the variance below that requested by 
the buyer. 

We assume that the market maker is stateless: he does not keep a log of past queries, of past 
users, or of their answers. As a consequence, each query is answered using an independent random 
variable. If the same buyer issues the same query repeatedly, the market maker answers using 
independent samples from the random variable /C. Of course, the buyer would have to pay for each 
query separately. 

3.2 Answerability and Determinacy 

Before investigating arbitrage we establish the key concept of query answerability. This notion is 
well studied for deterministic queries and views |16|. 126] . but, in our setting, the queries are random 
variables, and it requires a precise definition. Our definition below directly extends the traditional 
definition from deterministic to randomized queries. 

Definition 7 (Answerability). We say that a query Q is answerable from a multi-set of queries 
S = {Qi, . . . , Qk} if there exists a function f : M. k — > R such that, for any mechanisms K\, . . ., K,}-, 
that answer the queries Qi, . . . , Qk, the composite mechanism f(Ki, ■ ■ . ,JCk) answers the query Q. 
We say that Q is linearly answerable from Qi, . . . , if the function f is linear. 

For a simple example, consider queries Qi = (qi,t>i) and Q2 = ((^2,^2) and mechanisms /Ci 
and IC2 that answer them. The query Q3 = ((qi + qa)/2, {v\ + t> 2 )/4) is answerable from Qi 
and Q2 because we can simply sum and scale the answers returned by the two mechanisms, and 
E ((/Ci + /C 2 )/2) = (E(/Ci) + E(/C 2 ))/2, and Var ((/Ci + /C 2 )/2) = (Var(£ a ) + Var(£ 2 ))/4. Since 
the function is linear, we say that the query is linearly answerable. 

How do we check if a query can be answered from a given set of queries? In this paper we give 
a partial answer, by characterizing when a query is linearly answerable. 

Definition 8 (Determinacy). The determinacy relation is a relation between a multi-set of queries 
S = {Qi, . . . , Qfe} and a query Q ; denoted S — > Q ; and defined by the following rules: 

Summation 

{(qi,i>i), . . . , (qi.Dfc)} -»• (qi + . . . + qfc,vi + • • • + v k ); 
Scalar multiplication Vc G M., (q, v) — > (cq, c 2 v); 
Relaxation (q, v) — > (q, v'), where v < v' , 

Transitivity // Si ->■ Qi, . . . , Si -)■ and {Qi, . . . , Q fc } -)■ Q, then [] k i=l S k -)■ Q. 

The following proposition gives a characterization of linear answerability: 

Proposition 9. Let S = {(qi, v{), . . . , (q m , v m )} be a multi-set of queries, and Q = (q, v) be a 
query. Then the following conditions are equivalent. 

1. Q is linearly answerable from S. 

2. S -> Q. 

3. There exists ci, . . . , c m such that ciqi + . . . + c m q m = q and c\v\ + . . . + c 2 m v m < v. 
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Proof. (1 3): Follows the definition of linear answerability. 

(2 3): It is clear that in the rules of the determinacy relation, summation, scalar multiplication 
and relaxation are special cases of 3. For the transitivity rule, for each i = 1, . . . , k, let /j be a 
linear function such that /i(Sj) = q» with variance no more than V{. Let / be a linear function 
such that /(qi, . . . , q&) = q with variance no more than v. Then /q = /(/i(Si), . . . , /fc(Sfc)) is a 
linear function to Ui=i ^/c and the variance introduced is no more than v. 
(3 2): Since (qi,^) ->■ (cjqj, c^), {(ciqi, cfui), . . . , 

(c m q m , c 2 m v m )} ->■ (ciqi + . . . + c m q m , cf«i + . . . + c m t; m ) = (q, cf «i + . . . + c^f m ) and (q, c\v\ + 
. . . + (? m Vm) — > (q, v), we obtain S — > Q. □ 

Thus, determinacy fully characterizes linear answerability. But it cannot characterize general 
answerability. Recall that we do not specify a noise distribution in the definition of a query 
answering mechanism. If the query answering mechanism does not use Gaussian noise, then non- 
linear composition functions may play an important role in query answering. This follows from the 
existence of an unbiased non-linear estimator whose variance is smaller than linear estimators |19| 
when the noise distribution is not Gaussian. 

In this paper we restrict our discussion to linear answerability; in other words, we assume that 
the buyer will attempt to derive new answers from existing queries only by computing linear com- 
binations. By |Prop. 9[ we will use the determinacy relation S — > Q instead of linear answerability. 

Deciding determinacy, S — > Q, can be done in polynomial time using a quadratic program. The 
program first determines whether q can be represented as a linear combination of queries in S. If 
the answer is yes, the quadratic program further checks whether there is a linear combination such 
that the variance of answering q with variance at most v. 

Proposition 10. Verifying whether a set S of m queries determines a query (q, v) can be done in 
PTIME(m, n). 

Proof. Given a set S = {(qi, . . • , (q m , v m )} and a query (q, v), the following quadratic program 
outputs the minimum possible variance to answer q using linear combinations of queries in S. 

Given: q, q 1} . . . , q m , vi, . . . , v m , 

Minimize: c\v\ + . . . + c^t> m , 

Subject to: ciqi + . . . + c m q m = q. 

Once the quadratic program is solved, one can compare c\v\ + . . . + c 2 m v m with v. According to 
the Proposition [9] S — > (q, v) if and only if c\v\ + . . . + c^w m < v. Since the quadratic program 
above has m variables and the constraints are a linear equation on n-dimensional vectors, it can be 
solved in PTIME(m, n) [5]. Thus the verification process can be done in PTIME(m, n) as well. □ 

3.3 Arbitrage-free Price Functions: Definition 

Arbitrage is possible when the answer to any query Q can be obtained more cheaply than the 
advertised price 7r(Q) from an alternative set of priced queries. When arbitrage is possible it 
complicates the interface between the buyer and market maker: the buyer may need to reason 
carefully about his queries to achieve the lowest price, while at the same time the market maker 
may not achieve the revenue intended by some of his advertised prices. 

Definition 11 ( Arbitrage- free) . A price function vr(Q) is arbitrage-free if {Qi, . . . , Qm} - ► Q 

implies: 

m 

vr(Q) < 5>(Qi). 

i=i 
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Example 12. Consider a query (q, v) offered for price 7r(q, v). A buyer who wishes to improve the 
accuracy of the query may ask the same query n times, (q, v), (q, v), . . ., (q, v), at a total cost of 
n ■ 7r(q, v). The buyer then computes the average of the query answers to get an estimated answer 
with a much lower variance, namely v/n. The price function must ensure that the total payment 
collected from the buyer covers the cost of this lower variance, in other words n-7r(q, v) > vr(q, v/n). 
If it is arbitrage free, then it is easy to check that this condition holds. Indeed, {(q, v), . . . , (q, v)} —> 
(nq, nv) — > (q,v/n), and arbitrage-freeness implies ir(q,v/n) < vr(q, «) + ... + 7r(q, v) = n ■ vr(q, v ). 

We prove that any arbitrage- free pricing function satisfies some simple properties: 

Proposition 13. Let tt be an arbitrage- free pricing function. Then: 

(1) The zero query is free: tt(0,v) = 0. 

(2) Higher variance is cheaper: v < v' implies ir(q,v) > vr(q, v'). 

(3) The zero-variance query is the most expensiv^ vr(q, 0) > vr(q, v) for all v > 0. 

(4) Infinite noise is free: if tt is a continuous function, then vr(q, oo) = 0. 



Proof. For (1), we have — > (0,0) by the first rule of Def. 8 (taking k = 0, i.e. S = 0) and 
(0,0) — > (0,v) by the third rule; hence tt(0,v) = 0. (2) follows from (q, v) — > (q, v') when v < v' . 
(3) follows immediately, since all variances are v > 0. For (4), we use the second rule to derive 
(1/c-q, v) — > (q, c 2 -v), hence 7r(q, oo) = lim^oo 7r(q, c 2 -v) < lim^oo 7r(l/c-q, v) = 7r(0, v) = 0. □ 

Arbitrage- free price functions have been studied before |20t 122) . but only in the context of 
deterministic (i.e. unperturbed) query answers. Our definition extends those in [201 E2] to queries 
with perturbed answers. 

3.4 Arbitrage-free Price Functions: Synthesis 

Clearly, a market-maker wants to choose a price function that is arbitrage-free. Here we address 
the question of how to design such functions. Obviously, the trivial pricing function vr(Q) = 0, for 
all Q, under which every query is free, is arbitrage- free, but we want to design non-trivial pricing 
functions. For example, it would be a mistake for the market-maker to charge a constant price 
c > for each query, i.e. vr(Q) = c for all Q, because such a pricing function leads to arbitrage 



(this follows from Prop. 13) 



We start by analyzing how the price function 7r(q, v) depends on the variance v. By (2) of 



Prop. 13 we know that it is monotonically decreasing in v, and by (4) it cannot be independent of 



v (unless tt is trivial). The next proposition shows that it cannot decrease faster than 1/v: 

Proposition 14. For any arbitrage- free price function tt and any linear query q ; vr(q, v) = Q(l/v). 

Proof. Suppose the contrary: there exists a linear query q and a sequence {vi}'*L 1 such that 
limj-s.oo Vi = +oo and lim^oo fj7r(q, v{) = 0. Select io such that Vi > 1 and t>j 7r(q, Vi ) < 7r(q, l)/2. 
Then, we can answer 7r(q, 1) by asking the query 7r(q, V{ ) at most \vi ~\ times and computing their 
average. The price we pay for these \vi ~\ queries is 

\vi ]n(q.,v io ) < (v io + l)7r(q, u io ) < 2u io vr(q, v io ) < 7r(q,l), 

which implies that we have arbitrage, which is a contradiction. □ 



1 It is possible that 7r(q, 0) = oo. 
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Our next step is to understand the dependency on q, and for that we will assume that ir is 
inverse proportional to v, in other words that it decreases at a rate 1/v, which is the fastest rate 
allowed by the previous proposition. Set 7r(q, v) = f 2 (q)/v, for some positive function / that 
depends only on q. We prove that ir is arbitrage-free iff / is a semi-norm. Recall that a semi-norm 
is a function / : W 1 — > R that satisfies the following propertied} 

• For any c G IR and any q £ W 1 , /(cq) = |c|/(q). 

. For any qi , q 2 G R n , + q 2 ) < /( qi ) + /(q 2 ). 

We prove: 

Theorem 15. Let 7r(q, v) be a price function s.t^ 7f(q, v) = / 2 (q)/w for some function f. Then 
7r(q, v) is arbitrage- free iff /(q) is a semi-norm. 

Proof. (=>) : Assuming ir is arbitrage-free, we prove that / is a semi-norm. For c 7^ 0, by the second 
rule of Def. 8[ we have both: 

(q,u) ->(cq,c 2 v) 

(cq, c 2 v) ->■(- x cq, (-) 2 x c 2 v) —?■ (q, v) 
c c 

Therefore both 7r(q, v) < ir(cq,c 2 v) and 7r(q, v) > vr(cq, c 2 v) hold, thus 7r(q, v) = vr(cq, c 2 v). This 
implies that, if c 7^ 0, 

/(cq) = y/ir{cq, c 2 v)c 2 v = \c\y/ir(<i,v)v = |c|/(q). 

If c = 0, we also have /(cq) = \J vr(cq, c 2 v)c 2 v = = |c|/(q). 

Next we prove that /(qi + q 2 ) < /(qi) + /(q2)- Set the variances v\ = /(qi) and w 2 = 
/(q 2 ); then we have /(qi) = 7r(qi,^i) and /(q 2 ) = vr(q 2 ,f 2 ). By the first rule in Def. 8 we have 
{(qi>«i), (q2,^2)} -> (qi + q2,«i + w 2 ), and therefore: 

/ 2 (qi + q 2 ) . , ^ , 

— — — — — r =vr qi + q 2 , ui + w 2 

/(qi) + /(qa) 

<7r(qi, «i) + ?r(q 2 , u 2 ) = /(qi) + /(q 2 ) 

which proves the claim. 

(<=) : Suppose 7r(q, v) = / 2 (q)/f and /(q) is a semi-norm. According to Proposition |9j 
{(qi, v\), . . . , (q m , u m )} ->■ (q, v) if and only if there exists ci, . . . , c m such that ciqi+. . -+c m q m = q 
and c 2 v± + . . . + (? m v m < u- Then, 



m 



/ 2 (qo (Er=i% 1 )(Er=i^) 



1=1 1=1 

> (E™ih|/(q l )) 2 _(E™i/M i )) 2 



i=i c i w * Ei=i c i w * 

> = 7r(q,u), 

where the first inequality follows Cauchy-Schwarz inequality and the second comes from the sub- 
additivity of the semi-norm. □ 



2 Taking c = in the first property implies /(O) = 0; if the converse also holds, i.e. / (q) = implies q = 0, then 
/ is called a norm. Also, recall that any semi-norm satisfies /(q) > 0, by the triangle inequality. 



3 In other words, /(q) = ^J-n{c\,v)v is independent of v. 
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As an immediate application of the theorem, let's instantiate / to be one of the norms L2, Loo, L p , 
or a weighted norm L2. This implies that the following four functions are arbitrage- free: 

7r(q,«) =||q||2/« = Y, q i/ V W 

i 

7r(q,^) = Hq|lLA' = maxQi/v (2) 

1 

7r(c l ,v)=\\c L \\l/v = (J2Qf) 2/p /v p>l (3) 

i 

k(<1,v) =(^2wi- qf)/v w h ...,w n >0 (4) 

i 

However, these are not the only arbitrage-free pricing functions: the proposition below gives us 
a general method for synthesizing new arbitrage-free pricing functions from existing ones. Recall 
that a function / : — > R + is called subadditive if for any two vectors x, y E /(x + y) < 

/(x) + /(y); the function is called non- decreasing if x < y implies /(x) < /(y). 

Proposition 16. Let f : (M + ) fc — > M + 6e a subadditive, non- decreasing function. Then, for any 
arbitrage- free price functions ni, . . . , iTk, the function vr(Q) = /(vri(Q), . . . , 7Tfc(Q)) is afeo arbitrage- 
free. 

Proof. For any query Q, denote f(Q) = (7ri(Q), . . . ^(Q)). Assume {(qi,ui), . .., (q m ,w™)} -> 
(q, u). We have: 

?r(Q) < 7f (Qj) because each 7Tj is arbitrage-free 

i 

/(7f(Q)) </(y^7f(Qi)) because / is non-decreasing 

j 

< /(7f(Qj)) because / is sub-additive 

i 

□ 



Proposition 16 allows us to synthesize new arbitrage- free price function from existing arbitrage- 
free price functions. Below we include some operations that satisfy the requirements in Proposi- 
tion CES 

Corollary 17. If tt\, . ..,irk are arbitrage- free price functions, then so are the following functions: 

• Addition: tt\ + . . . + tt^; 

• Maximum: max(7Ti, . . . , tt^); 

• Cut-off: min(7Ti,c) ; where c is a constant; 

• Power: tx\ where < c < 1; 

• Logarithmic: log(7Ti + 1);' 

• Geometric mean: yjitx ■ 1^2- 

Proof. It is clear that all the functions above are monotonically increasing. One can check directly 
that maximum and cut-off functions are sub-additive. Sub-additivity for the rest follows from the 
following: 
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Lemma 18. Let f : (R + ) k -> R+ be a non- decreasing function s.t. /(O) = and all second 
derivatives are continuous. Then, if d 2 f / dxidxj < for all i,j = l,...,k, then f is sub-additive. 

Proof. Denote fa = df/dxt and f, L j = d 2 f / dxidxj. We apply twice the first-order Taylor approxi- 
mation /(x) — /(0) = ^2i(df/dxi)((,) ■ Xi, once to g(y) = /(x + y) — /(y), and the second time to 

/(x) + /(y) - /(x + y) = [/(x) - /(0)] + [/(x + y) - /(y)] 
=<7(0) -<?(y) = -J>; (£)•%■ 

= - ^(/, (x + o - /i (o) ■ fi = - E & fa + o • *i • > o 

□ 
□ 

Example 19. For a simple illustration we will prove that the pricing function 7r(q, u) = maxj l^l/^/u 



zs arbitrage free. Start from 7Tx(q, u) = maxjqf/v, whi ch is arbitrage- free by Eq. 2, then notice that 
7r = (vri) 1 / 2 , hence it is arbitrage- free by Corollary 11, 



3.5 Selling the True Private Data 

In some scenarios, the data owner is willing to sell her unperturbed private data, at some high price. 
This is not possible in traditional differential privacy, because any query answering mechanism that 
releases the true, unperturbed answer to some query cannot satisfy e-differential privacy for any 
e. In other words, true answers require an infinite privacy budget; similarly, our simple pricing 
functions in Eq. lfEq. 4 also set an infinite price for any query with a zero variance. However, in 



a data market in which users are compensated for their private information, there may be a finite 
price for which owners are willing to release their true data to buyers and we therefore seek price 
functions that have finite maximum price. For that we need to construct a pricing function that 
allows the variance v = 0, in other words, it returns a finite price for queries of the form 7r(q, 0). 

A simple approach is to use the cut-off function, in Corollary [TT| to convert any arbitrage- 
free price function into a bounded arbitrage-free price function. Such a function sets a accuracy 
threshold above which the query price is unchanged. More sophisticated functions are possible by 
using sigmoid curves, often used as learning curves by the machine learning community. Many 
of those curves are concave and monotonically increasing over IR + , which, by Lemma 18, are 
subadditive on IR + when /(0) = 0. Thus, we can apply functions of those learning curves that are 
centered at to Proposition [16] so as to generate smooth arbitrage- free price functions with finite 
maximum. Below we describe some candidate functions, and include the proof in the appendix. 

Corollary 20. Given an arbitrage- free price function it, each of the following functions is also 
arbitrage- free, and bounded: atan(ir), tanh(ir), tt/^/tt 2 + 1. 

For example, the pricing function^] 7r(q, v) = j£ ■ atan(^ i qf/v ) is arbitrage-free and sets the 
price of the true private data to $10; any perturbed query costs < $10. 



We denote IT the constant "pi" , in order to avoid confusion with the pricing function n. 
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4 Privacy Loss 



In this section we describe the second component of the pricing framework in Fig. 1 the privacy 
loss £j. Recall that, for each buyer's query Q = (q, v), the market maker defines a random function 
JCq, such that, for any database instance x, the random variable /Cq(x) has expectation q(x) and 
variance less than or equal to v. By answering the query through this mechanism, the market 
maker leaks some information about each data item Xi, and its owner expects to be compensated 
appropriately. In this section we define formally the privacy loss, and establish a few properties; in 
the next section we will relate the privacy loss to the micro-payment that the owner expects. 

Our definition of privacy loss is adapted from differential privacy, which compares the output of 
a mechanism with and without the contribution of the data item x%. For that, we need to impose 
a bound on the possible values of X{. We fix a bounded domain of values ICi, and assume that 
each data item Xi is in X. For example, in case of binary data values X = {0, 1} (0 = owner does 
not have the feature, 1 = she does have the feature), or in case of ages, X = [0, 150], etc. 

Given the database instance x, denote xW the database instance obtained by setting Xi = 0, 
and leaving all other values unchanged. That is, x^ represents the database without the user i. 

Definition 21. Let K, be any mechanism (meaning: for any database instance x, /C(x) is a random 
variable). The privacy loss to user i, in notation £j(/C) G M + is defined as: 



£i(/C) =sup s 



Pr(/C(x)G5) 

log 



Pr (/C(xW) G S) 

where x ranges over X n and S ranges over measurable sets o/K. 

We explain the connection to differential privacy in the next section. For now, we derive some 
simple properties of the privacy loss function. The following are well known 



Proposition 22. (1) Suppose K, is a deterministic mechanism. Then £j(/C) = when /C is inde- 
pendent of the input Xi, and £i(/C) = oo otherwise. (2) Let tC\, . . . , K, m , be mechanisms with privacy 
losses £i, . . . , £ m . Let K, = c\ ■ K.\ + . . . + c m • K, m be a linear combination. Then its privacy loss is 
e(/C) = | ci | • £i + . . . + \c m \ ■ e r 



-m ■ 



In this paper we restrict the mechanism to be data-independent. 

Definition 23. A query-answering mechanism K, is called data independent if, for any query 
Q = (q, v), /Cq(x) = q(x) + p(v), where p(v) is a random function. 

In other words, a data-independent mechanism for answering Q = (q, v) will first compute the 
true query answer q(x), then add a noise p(v) that depends only on the buyer's specified variance, 
and is independent on the database instance. We prove: 

Proposition 24. Let K, be any data-independent mechanism. If the query Q = (q, v) has the i th 

component equal to zero, % = 0, then £j(/Cq) = 0. In other words, users who don't contribute to a 
query's answer suffer no privacy loss. 

Proof. The two random variables /Cq(x) and /Cq(xW) are equal, because /Cq(x) = q(x) + p(v) = 
q(xW) + p(v) = £q(xW), which proves the claim. □ 

In contrast, a data-dependent mechanism might compute the noise as a function of all data 
items x, and may result in a privacy loss for the data item Xi even when = 0. For that reason 
we only consider data- independent mechanisms in this paper. 
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The privacy loss given by Def. 21 is difficult to compute in general. Instead, we will follow the 
techniques developed for differential privacy, and give an upper bound based on query sensitivity. 
Let 7 = swp xeX \x\. 

Definition 25 (Personalized Sensitivity). The sensitivity of a query q at data item X{ is defined 
as 

Si = siip xgX „|q(x) - q(x W )| = 7 ■ 

We let Lap(b) denote the one- dimensional Laplacian distribution centered at with scale b and 
the corresponding probability density function g(x) = ■ 

Definition 26. The Laplacian Mechanism, denoted C, is the data-independent mechanism defined 
as follows: for a given query Q = (q, v) and database instance x, the mechanism returns £q(x) = 
q(x) + p, where p is a noise with distribution Lap(b) and b = <Jv/2. 

The following is known from the work on differential privacy |12j . 

Proposition 27. Let C be the Laplacian mechanism and Q = (q, v) be a query. Then, the privacy 
loss of individual i is bounded by: 




5 Micro-Payments to Data Owners 



In this section we describe the third component of the pricing framework in Fig. 1 the micro- 
payments pi. 

By answering a buyer's query using some mechanism /Cq, the market maker leaks some of 
the private data of the data owners. The market maker must compensate these data owners, in 
proportion to the degree of their privacy loss, using money collected from the buyer's payment 
7r(Q). Recall that we have n data owners (not necessarily distinct), each contributing data item Xi. 
In response to a query Q, the data owner receives a micro-payment /Xj(Q). The micro-payments pi 



close the loop in Fig. 1 , and must be defined in a way that makes them consistent with the payment 
function it and the privacy loss £j caused by the query answering mechanism. In this section we 
will make this connection precise. Before that, we require the micro-payment functions to satisfy 
two simple properties: 

Definition 28. Let pi be a micro-payment function. We define the following two properties: 
Fairness For each i, if qi = 0, then pi(q,v) = 0. 

Micro arbitrage- free For each i, /ij(Q) is an arbitrage-free pricing function. 

Fairness is self-explanatory: data owners whose data is not queried should not expect payment. 
Arbitrage-freeness is a promise that the owner's loss of privacy will be compensated, and that there 
is no way for the buyer to circumvent the due micro-payment by asking other queries and combining 



their answers. This is distinct from the guarantee of arbitrage-freeness in Sect. 3 which refers to 
the overall price vr(Q), and must be verified for each user. 
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6 Balanced Pricing Frameworks 



Finally, we discuss the interaction between the three components in Fig. 1 , the query price it, 
the privacy loss £j, and the micro-payments fa, and define formally when a pricing framework is 
balanced. Then, we give a general procedure for designing a balanced pricing framework. 

6.1 Balanced Pricing Frameworks: Definition 

The contract between the data owner of item Xi and the market-maker consists of a non-decreasing 
function Wi : M + — > M + , s.t. Wj(0) = 0. This function represents a guarantee to the data owner 
that she will be compensated with at least fa > Wi{Ei) in the event of a privacy loss £j. We denote 
W = (Wi, . . . , W n ) the contract between the market-maker and all data owners. 

The connection between the micro-payments fa, the query price tt and the privacy loss £j is 
captured by the following definition. 

Definition 29. We say that the micro-payment functions fa, i = 1, . . . ,n are cost-recovering for 
a pricing function it if, for any query Q, 7r(Q) > J2i Mi(Q)- 

Fix a query answering mechanism K,. We say that a micro-payment function fa is compensating 
for a contract function Wi, if for any query Q, fa(Q) > ^(^(/Cq)). 

The market maker will insist that the micro-payment functions is cost-recovering: otherwise, 
he will not be able to pay the data owners from the buyer's payment. 

A data owner will insist that the micro-payment function is compensating: this enforces the 
contract between her and the market-maker, guaranteeing that she will be compensated at least 
Wi(ei), in the event of a privacy loss 



Finally, we say that a pricing framework Fig. 1 is balanced if its components satisfy all the 



desirable properties we discussed in our paper. A pricing framework consists of (1) the pricing 
function tt, (2) the query answering mechanism JC; instead of the mechanism tC, we will indicate 
the privacy loss function, £j(/Cq), (3) the micro-payments fa, and (4) the contract functions Wi. 
Thus, we denote the pricing framework {ir,s, fa W). 

Definition 30. A pricing framework (n,e,fa W) is balanced if (1) tt is arbitrage-free and (2) the 
micro-payment functions /U are fair, micro arbitrage- free, cost-recovering for tt, and compensating 
forW. 

In the rest of this section we will give a general procedure for synthesizing balanced pricing 
frameworks. Before that, we discuss the distinction between our framework and differential privacy. 

Discussion The contract between the data owner and the market maker differs from the con- 
tract in privacy-preserving mechanisms. Let e > be a small constant. A mechanism tC is called 
differentially private if, for any user i and for any measurable set S, and any database instance 
x: 

Pro6(/C(x)) eS)<e £ x Pro6(/C(x«) e 5). 

In a differentially private the basic contract between the mechanism and the data owner is the 
promise to every user that her privacy loss is no larger than e. In our framework for pricing 



private data we turn this contract around. Now, privacy is lost, and Def. 21 quantifies this loss. 
The contract is that the users are compensated according to their privacy loss. At an extreme, 
if the mechanism is £-differentially private for a tiny e, then each user will receive only a tiny 
micro-payment Wi{e); as her privacy loss increases, she will be compensated more. 
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Notice how micro-payments circumvent a fundamental limitation of differential-privacy mecha- 
nisms. In differential privacy, if the buyer has a fixed budget e for all queries that he may ever ask. 
In order to issue N queries, he needs to divide the privacy budget among these queries, and, as a 
result, each query will be perturbed with a higher noise; after issuing these iV queries, he can no 
longer query the database, because otherwise the contract with the data owner would be breached. 
In our pricing framework there is no such limitation, because the buyer simply pays for each query. 
The budget is now a real dollar budget, and the buyer can ask as many query as he wants, with as 
high accuracy as he wants, as long as he has money to pay for them. 



6.2 Balanced Pricing Frameworks: Synthesis 



Call (e, /i,W) semi-balanced if all micro-payment functions are fair, micro-arbitrage free, and com- 
pensating w.r.t. /C; that is, we leave out the pricing function tt and the cost-recovering requirement. 
The first step is to design a semi-balanced set of micro-payment functions. 

Proposition 31. Let C be the Laplacian Mechanism, and let the contract functions be linear, 



Wi(ei) = a ■ Ei, 
functions /Uj(Q) 



where Ci > is a fixed constant, for i = 1, 



, n. 



Define the micro-payment 



■Jvfr 



\, fori = 1, 



, n. 



Then (e, /i, W) is semi-balanced. 



P roof. E ach /ij is fair, because q 
Eq. 4 the function 7Tj(Q) 



m 



implies /ij = 0. By setting w 
is arbitrage free. By 



fo(Q)) 



1/2 



Corollary 17 



27 2 • c f and Wj = for j ^ i 
the function /ij(Q) = 



we have Wi{£i(Cq)) 



is also arbitrage-free, which means that /i, is micro-arbitrage free. Finally, by Prop. 27 
Ci ■ £j(£q) < Ci | q i | = /ij(Q), proving that /ij is compensating. 



□ 



Next, we show how to derive new semi-balanced micro-payments from existing ones. 



i4i), 



Proposition 32. Suppose that (e, ^ , W J ) is semi-balanced, for j = 1, . . . , k (where /i J = (/i|, . . 
and W j = (W{ , W&), for j = l,...,k), and let fi : (R + ) k ->• R+ ; % = 1, . . . , n, be n non- 
decreasing, sub-additive functions s.t. fi(0) = 0, for all i = 1, . . . ,n. Define fii = fi(n\, . . . 

fi{Wf, . . . , W k ), for each i = 1, ...,n. Then, (e,/i,W) is also semi-balanced, where 
,fi n ) andW = (W l ,...,W n ). 



and Wi 



Proof. First, we prove fairness for /jf. if qi = 0, then /4(Q) = ... = jU^(Q) 
assumption, each \x\ is fair. Hence, fi(/j,j(Q), . . . , /if (Q)) = because fi(0) 



because, by 
: 0. Next, by 

Prop. 16, each /ij is arbitrage- free. Finally, each /ij is compensating for Wi, because the func- 
tions fi are non-decreasing, and each \x\ is compensating for W- , hence /,(/i|(Q), . . . , /if (Q)) > 
f i (W}(e l {K Q )) 1 ...,Wf{e l (lC Cl ))) = WM^))- ' □ 

We can use this proposition to design micro-payment functions that allow the true private data 



of an individual to be disclosed, as in Sect. 3.5 We illustrate this with an example. 



Example 33. Consider Example 3, where several voters give a rating in {0,1,2,3,4,5} to each 



of two candidates A and B. Thus, x\,X2 represent the ratings of voter 1, ^3,^4 of voter 2, etc. 
Suppose voter 1 values her privacy highly, and would never accept a total disclosure: we choose 
linear contract functions W\(e) = W2(e) = c • e for her two votes, and define the micro-payments 

' w(Q) 



as in 



Prop. 31 



6-c 

/v/2 



\qi\ for i = 1,2. On the other hand, voter 2 is less concerned about 



her privacy, and is willing to sell the true values of her votes, at some high price d > 0: then we 
choose bounded contract functions W^fe) = IV^e) = 2 • d/Ii ■ atan{e) (which is sub-additive, by 
Corollary 20), and define the micro-payments accordingly, /Uj(Q) = 2 • d/H • atan{ 



/v/2' 



), for 
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i = 3,4. By Prop. 32 this function is also compensating and micro arbitrage-free, and, moreover, 
it is bounded by m < d, where the upper bound d is reached by the total-disclosure query (v = 0). 

Finally, we choose a payment function such as to ensure that the micro-payments are recovering. 

Proposition 34. (1) Suppose that {e, fx, W) is semi-balanced, and define 7r(Q) = X^iMi(Q)- Then, 
(tt, e, [i, W) is balanced. 

(2) Suppose that (w, s, (J,, W) is balanced andir' > tt is any arbitrage- free pricing function. Then 
(7r',e,/i,W) is also balanced. 



Proof. Claim (1) follows from Corollary 17 (the sum of arbitrage- free functions is also arbitrage- 
free), while claim (2) is straighforward. □ 

To summarize, the synthesis procedure for a pricing framework proceeds as follows. Start with 
the simple micro-payment functions given by |Prop. 31 which ensure a linear compensation for each 
user. Next, modify both the micro-payment and the contract functions using Prop. 32 as desired, 
in order to adjust to the preferences of individual users, for example, in order to allow a user to 
set a price for her true data. Finally, define the query price to be the sum of all micropayments 



(Prop. 34), then increase this price freely, by using any method in Corollary 17 



7 Discussion 



In this section, we discuss two problems in pricing private data, and show how they affect our 
pricing framework. The first is how to incentivize data owners to participate in the database 
and truthfully report their privacy valuations, which is reflected in her contract function W%. this 
property is called truthfulness in mechanism design. The second concerns protection of the privacy 
valuations itself, meaning that the contract W{ may also leak information to the buyer. 



7.1 Truthfulness 

How can we incentivize a user to participate, and to reveal her true value for the privacy loss of a 
data item x{t All things being equal, the data owner will quote an impossibly high price, for even 
a tiny loss of her privacy. In other words, she would choose a contract function W(e) that is as 
close to oo as possible. 

Incentivizing users to report their true valuation is the topic of mechanism design. This has 
been studied for private data only in the restricted case of a single query, and been shown to be 
a difficult task. Ghosh and Roth [15] basically show that if the privacy valuations are sensitive, 
then it is impossible to design truthful and individually rational direct revelation mechanisms. |13j 
circumvents this impossibility result by assuming that the privacy valuation is drawn from known 
probability distributions. Also, according to some experimental studies [I], the owner's valuation is 
often complicated, and difficult for the owner to articulate; plus, different people might have quite 
different valuations. Indeed, without a context or reference, it is hard for people to understand the 
valuation of their private data. The design of a truthful and private mechanism for private data, 
even for a single query, is still an active research topic. 

Instead, we propose a simpler approach, adopted directly from that introduced by Aperjis and 
Huberman [2]. Instead of asking for their valuations, users are given a fixed number of options. 



For example, the users may be offered a choice between two contract functions, shown in Fig. 2 
which we call (following [2j), Options A and B: 
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Option A If is a step function: for modest privacy losses, there is no micro-payment, but for 
significant privacy losses there is a significant payment. 

Option B W is a slowly-increasing sigmoid, or arctangent: there is a non-zero micro-payment for 
even the smallest privacy losses, but even the maximal payment is much lower than that of 
Option A. 

While these options were initially designed for a sampling-based query answering mechanism [2] , 
they also work for our perturbation-based mechanism. Risk-neutral users will typically choose 
Option A, while risk-averse users will choose Option B. Clearly, a good user interface will offer 
more than two options; designing a set of options that users can easily understand is a difficult 
task, which we leave to future work. 

A 



s * - 




I I I I I 

0.0 0.2 0.4 0.6 0.8 

E 



Figure 2: Two options for the contract function W. Option A makes no micro-payments for small 
privacy losses, and makes a large payment for large privacy losses. Option B pays even for small 
amounts of privacy losses, but for large privacy losses pays less than A. Risk-neutral users typically 
choose Option A, while risk-averse users choose Option B. 

7.2 Private Valuations 

When users have sufficient freedom to choose their privacy valuation (i.e. their contract function 
Wt), then we may face a quite difficult problem: the privacy valuation may be strongly correlated 
with the data Xi itself, thus, just the price of a query may lead to privacy loss. For example, 
consider a database of HIV status: Xi = 1 means that data owner i has HIV, X{ = means that 
she does not. Typically, users who have HIV will set a much higher value on privacy of their Xi 
than those who don't have HIV. For example, users without HIV may ask for $1 for Xi, while users 
who do have HIV may ask for $1000. Then, a savvy buyer may simply ask for the price of a query, 
without actually purchasing the query, and determine with some reasonable confidence if a user has 
HIV or not. Hiding the valuation itself is a difficult problem, which is still being actively researched 
in mechanism design |13j . 

If the price itself is private, then inquires about prices need to be perturbed in the same fashion 
as queries on the data. Thus, the price 7r(Q) and the micro-payments /U«(Q) need to be random 
variables. Queries are answered using a mechanism K, while prices are computed using a, possibly 
different, mechanism KJ . We show, briefly, that, if the contract functions are linear W% = Ci ■ £{, 
then it is possible to extend our pricing framework to ensure that data owners are compensated 
both for their privacy loss of the data and their privacy loss of the price. The properties of 
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arbitrage- freeness, cost-recovering, and compensation are now defined in terms of expected values; 
for example, a randomized price function vr(Q) is arbitrage-free, if {Qi,...,Q m } — > Q implies 

E(vr(Q))<E™iEWQ l ))- 

Now the privacy loss for data item x% includes two parts. One part is due to the release of the 
query answer, and the other part is due to the release of the price. Their values are £j(/C) and 
£i(fC') respectively. A micropayment is compensating if E (/ij(Q)) > a • (£j(/C) + Si{K,')). 

As for the data items, we assume that the constants q used in the contract function are drawn 



from a bounded domain K Cl, and denote 5 = sup cg y|c| (in analogy to 7 defined in Sect. 4) 
Assume that both K, and IC' are Laplacian mechanisms. Given a query Q = (q, v) , set b = \/v/2 
choose som^] b' > 5, tunable by the market maker. IC is the mechanism that, on an input x, 



returns q(x) + p, where p is a noise with distribution Lap(b). IC' is the mechanism that, on an 

-yb' 
b-(b'-S) 
-b' 
b-(b'-5) 

„ a — 

b-(b'-S) 

MIC) 



input c, returns a noisy price b .^_ s ^ Si c «k«l + p' > where p' is a noise with distribution Lapib'). 
We denote the exact price, b ? b ?_ & \ Y^i c i ' l%l> as E (^-'( c ))- The sensitivity of the mechanism IC is 



Si(lC) = 7 • \qi\ (Def. 25). If we define SiQC') = ? jUf'Jf ; then we can prove that: 



ei{K!) < 



b 

Si{K') 



b' 

We prove the following in the Appendix: 

Proposition 35. Let /C,/C' be Laplacian mechanisms (as described above) and Q = (q, v ) be a 
query. Set (as above), b = \/v/2 and b' > 5. Define: 

vr(Q) =/C'(c) = E (/C'(c)) + p 

(n , MK) 8i(K') vr(Q)-E(r(c)) 
W(Q) =(— + — ) • tH + , 

Vi = 1, . . . , n 
Then, (tt,p,£, W) is a balanced mechanism. 



8 Related Work 

The study of the tradeoff between privacy and utility in statistical databases was initiated by Dinur 
and Nissim [10], and culminated in |12j . where Dwork, McSherry, Nissim and Smith introduced 
differential privacy and the Laplacian mechanism. The goal of this line of research is to reveal 
accurate statistics while preserving the privacy of the individuals. There have been two (somewhat 
artificially divided) models involved: the non-interactive model, e.g. [4, 21j[23l|9], and the interac- 
tive model, e.g. |30|, I17j . In this paper, we use the interactive model, in which the queries arrive 
on-line (one at a time) and the market maker has to answer them and charge for them interactively. 
There is a vast and growing literature on differential privacy; we refer the readers to the recent 
survey by Dwork [11]. There is privacy loss in releasing statistics in a differentially private sense 
(quantified in terms of the privacy parameter /budget e). However, this line of research does not 
consider compensating the privacy loss. 

Ghosh and Roth [15] initiated a study of how to incentivize individuals to contribute their 
private data and to truthfully report their privacy valuation using tools of mechanism design. They 

5 When b' < S, the expectation of the price it is infinite. 
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consider the same problem as we do (namely pricing private data) but from a different perspective 
and using a different approach: they fix the query, assume that individuals can have arbitrary 
privacy valuation, and try to design a truthful mechanism. On the other hand, we assume that the 
queries arrive online and that individuals have a fixed privacy valuation (stored in the database), 
and focus on charging the buyers and compensating the individuals in a consistent and principled 
manner. Another key difference between our work and |15j is that we require not only accuracy but 
also unbiasedness for the noisy answer to a certain query while in [15] answers are not unbiased. 
There have been some follow-ups to [15] . e.g. [HI I2H Ell EJ- Please refer to the survey [29] for 
details. 

In the database community, Balazinska, Howe and Suciu [3] initiated a study of data markets in 
the cloud. Subsequently, [20] proposed a data pricing method which first sets explicit price points 
on a set of views and then computes the implied price for any query. However, they did not consider 
the potential privacy risks of their method. The query determinacy used in [20] is instance-based, 
and as a result, the adversary could (in the worst case) learn the entire database solely by asking 
the prices of queries (for free). Li and Miklau study data pricing for linear aggregation queries [22] 
using a notion of instance-independent query determinacy. This avoids some privacy risks, but it 
is still sometimes possible to infer query answers for which the buyer has not paid. Both of the 
above works consider a model in which unperturbed query answers are exchanged for payment. 
In this paper we consider noisy query answers and use an instance-independent notion of query 
determinacy, which allows us to formally model private disclosures and assign prices accordingly. 

Aperjis and Huberman [2] describe a simple strategy to collect private data from individuals and 
compensate them, based on an assumption in sociology that some people are risk averse. By doing 
so, buyers could compensate individuals with relatively less money. More specifically, a buyer may 
access the private data of an individual with probability 0.2, and offer her two choices: if the data 
is accessed, then she would be paid $10, otherwise she would receive nothing; she would receive 
$1 regardless whether her data would be used or not. Then a risk-averse person may choose the 
second choice, and consequently the buyer can save $1 in expectation. In their paper, the private 
data of an individual is either entirely exposed, or completely unused. In our framework, there 
are different levels of privacy, the privacy loss is carefully quantified and compensated, and thus 
the data is better protected. Finally, Riederer et al. [28J propose auction methods to sell private 
data to aggregators, but an owner's data is either completely hidden or totally disclosed and the 
price of data is ultimately determined by buyers without consideration of owners' personal privacy 
valuations. 

9 Conclusions 

We have introduced a framework for selling private data. Buyers can purchase any linear query, with 
any amount of perturbation, and need to pay accordingly; data owners, in turn, are compensated 
according to the privacy loss that they incur for each query. In our framework buyers are allowed 
to ask an arbitrary number of queries, and we have designed techniques for ensuring that the prices 
are arbitrage- free, meaning that buyers are guaranteed to pay for any information they may further 
extract from the queries. Our pricing framework is balanced, in the sense that the buyer's price 
covers for the micro-payments to the data owner, and each micro-payment compensates the users 
according to their privacy loss. 
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A Proof of 



Corollary 20 



By Lemma 18 it suffices to check that all first derivatives are > and all second derivatives are 
< 0, for all x > 0: 



d_ 
dx 
d 2 
dx 2 
d_ 
dx 
#_ 
dx 2 
d x 



atan(x) = 
at an (a;) 
tanh(x) = 
tanh(x) 



dx y/i + x 2 
d 2 x 
dx^^YT^ 



TT^ >0; 

2x 

~{l + x 2 ) 2 
1 



<0; 



cosh 2 (x) 
2tanh(x) 
cosh 2 (x) 



> 0; 



<0; 



{l + x 2 )- 3 2 > 0; 

= -3x(l + x 2 yi < 0. 



B Proof of Prop. 35 



We show that each m is fair in expectation. For individual i, if q% = 0, then by definition, SiQC) = 
and Si{K,') = 0, and thus 



B(^(q,v)) = ( 



Si(IC) SiQC') 



+ -^) x (H = 



We show that [/,{ is micro arbitrage-free in expectation. For each individual i, by definition, 

lb' ■ a ■ \qA 



E(Mi(Q)) 



b-(b' - 5) 
V-S Jj' 



By the same argument as in Prop. 31, E (/ij(Q)) is arbitrage-free, and thus )Uj(Q) is arbitrage- 
free in expectation. 

We show that the micro-payments are cost recovering. By definition, 

E^(Q) = E( s ^ ) + s 4^ ) )x^+^ / 



b ' V 

lb'\ qi 
'1W\ , b-(b'-S) 



E/1W\ . b-(b'-5)K i 



jb' 



b-{U- 5) 
vr(Q), 



• |^| + p 
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proving the claim. 

Finally, we show that /Xj is compensating, in expectation: For each individual i, 

„, ^ f 8i{K) sAKf) . 
E( Mi (Q)) = (^ + ^ r ^)xc i 

> (e i (/C) + e i - (/C') x ci, 

meaning that /Uj(Q) compensate user i for her loss of privacy in expectation. 
By a similar argument as in Prop. 34| 7r(Q) is arbitrage-free in expectation. 
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